feat: Expand benchmark, update params by Pringled · Pull Request #24 · MinishLab/semble

Pringled · 2026-04-17T09:13:29Z

The PR expands the benchmark from 29 repos (12 languages) to 66 repos (20 languages) for a total of 1318 queries. The main metric is also changed to the mean or per-language means to get a more balanced view of how well semble works across languages.

This means the old benchmark scores are not valid anymore. There's also a few params that are tuned.

Current dataset overview:

Language	Repos	Projects
bash	3	bash-it, bats-core, nvm
c	3	curl, libuv, redis
cpp	3	abseil-cpp, fmtlib, nlohmann-json
csharp	3	dapper, messagepack-csharp, newtonsoft-json
dart	3	dio, http-dart, riverpod
elixir	3	ecto, phoenix, plug
go	3	chi, cobra, gin
haskell	3	aeson, pandoc, xmonad
java	3	commons-lang, gson, jackson-databind
javascript	3	axios, express, redux
kotlin	3	exposed, kotlinx-coroutines, ktor
lua	3	lazy.nvim, mini.nvim, telescope.nvim
php	3	guzzle, laravel-framework, monolog
python	9	aiohttp, click, fastapi, flask, httpx, model2vec, pydantic, requests, starlette
ruby	3	rack, rails, sinatra
rust	3	axum, serde, tokio
scala	3	cats, circe, http4s
swift	3	alamofire, rxswift, vapor
typescript	3	trpc, vitest, zod
zig	3	zig, zig-clap, zls

Remove tasks where target files moved to external packages in newer versions (express v5 router/middleware, chi cors/redirect, ecto migration, phoenix template, rack session). Fix paths for jackson- databind BeanDeserializer, kotlinx-coroutines CoroutineContext, nlohmann-json json_pointer, and circe DecodingFailure.

- NL alpha 0.6 -> 0.5: equal weight semantic + BM25 (BM25 finds targets 2.3x more often than semantic among failure queries) - Stem boost multiplier 0.5 -> 1.0: stronger file-path keyword signal - Match ratio threshold 0.20 -> 0.10: boost files when any keyword matches, even for longer queries NDCG@10 on 50-repo benchmark: 0.838 -> 0.851 (+0.013)

Add semantic/architecture/symbol categories to 212 tasks across 14 repos that were missing them. Add 11 new express tasks to restore coverage after broken annotations were removed (9 -> 20 tasks). Total: 930 tasks across 48 repos, all categorized.

- commons-lang: reflectionEquals span 89-99 -> 179-318 (class header is not the reflection logic) - circe: auto/semiauto derivation target was Decoder.scala (wrong file), now points to generic/auto.scala + semiauto.scala - exposed: SchemaUtils target was abstract SchemaUtilityApi.kt, now points to the concrete SchemaUtils.kt in exposed-jdbc - sinatra: halt/pass/redirect span too narrow, use whole-file - sinatra: Rack build() method span was setup_default_middleware helper, now points to the actual build() method at line 1670 - sinatra: Helpers symbol span extended to cover halt (1028) and pass (1036)

guzzle +5, ktor +4, sinatra +4, messagepack-csharp +3, alamofire +3, tokio +3, trpc +3, cats +3. All repos now have >= 20 tasks. Total: 954 tasks across 48 repos.

- Add curl, redis, bats-core, aeson, http-dart, telescope.nvim, lazy.nvim, zig - 160 new annotation tasks (20 per repo) - Add .bash, .zig, .hs file extensions to file_walker - Overall NDCG@10: 0.841 across 56 repos

…mean-of-language-means - Add 10 new repos: nvm, bash-it (replaces gitflow-avh), pandoc, xmonad, dio, riverpod, nvim-lspconfig, mini.nvim, zls, zig-clap - Bring bash, haskell, dart, lua, zig all to 3+ repos - Fix run_benchmark.py aggregation: headline NDCG@10 is now mean of per-language means (one vote per language, not per repo), which previously over-weighted Python's 9 repos - Fix numpy float type annotation issue (float() cast on np.median) - New headline: NDCG@10 = 0.829 across 20 languages (66 repos)

… annotation audit - Fix n_relevant to use annotation count instead of index coverage (reviewer #5) - Add per-category NDCG@10 to printed summary and saved JSON (reviewer #7) - Replace 11 trivially-lexical semantic queries with vocabulary-diverse alternatives - Baseline: NDCG@10 = 0.825 (architecture=0.773, semantic=0.823, symbol=0.943)

…ent-scoped one - ktor: the server application query targeted files outside the benchmark_root; replaced with a client-side plugin pipeline query that indexes correctly - rxswift: Observable.swift is a thin declaration file; corrected relevant target to ObservableType.swift which contains the actual protocol definition - Swift +0.006, Kotlin +0.004, architecture category +0.002

- sinatra: fix 3 queries pointing to wrong/narrow line ranges in base.rb - circe: replace out-of-scope generic derivation query (targets modules/generic/ which is outside benchmark_root) with DecodingFailure/ParsingFailure query targeting Error.scala in core - cats: replace Semigroup/Monoid query pointing to kernel/ module (outside root) with MonoidK/SemigroupK query targeting core - rxswift: add Zip+arity.swift as second relevant for zip operator query - exposed: add Transactions.kt as second relevant for transaction block query NDCG@10: 0.825 (baseline) -> 0.830

Remove outdated result files from previous benchmark runs and add fresh result from current HEAD (NDCG@10=0.830).

- Remove nvim-lspconfig (4th lua repo, lowest score 0.583) to keep all languages at 3 repos - Fix bash-it and libuv annotations using non-standard 'api' and 'keyword' categories; remap to 'architecture' and 'symbol' - Refresh benchmark results: NDCG@10 = 0.833

Pringled added 20 commits April 16, 2026 12:25

Add more benchmarks

2b33d47

Add more benchmarks

2c6e4c2

Add more benchmarks

20a3328

Merge remote-tracking branch 'origin/main' into expand-benchmark

9b33742

fix: Fix cobra annotation - help.go merged into command.go

45396aa

chore: Top up 8 thin repos to 20 tasks each

8c80270

guzzle +5, ktor +4, sinatra +4, messagepack-csharp +3, alamofire +3, tokio +3, trpc +3, cats +3. All repos now have >= 20 tasks. Total: 954 tasks across 48 repos.

Add 8 new repos to benchmark (C, Bash, Haskell, Dart, Lua, Zig)

6c14e10

- Add curl, redis, bats-core, aeson, http-dart, telescope.nvim, lazy.nvim, zig - 160 new annotation tasks (20 per repo) - Add .bash, .zig, .hs file extensions to file_walker - Overall NDCG@10: 0.841 across 56 repos

add libuv as 3rd C repo; all 20 languages now have 3+ repos

970ee37

Refresh benchmark results (drop stale, add current run)

d684944

Remove outdated result files from previous benchmark runs and add fresh result from current HEAD (NDCG@10=0.830).

Remove stale benchmark result file

077d940

Regenerate benchmark results after fixing api/keyword categories

3b03cac

Pringled merged commit 256b839 into main Apr 17, 2026
8 checks passed

Pringled deleted the expand-benchmark branch April 22, 2026 05:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Expand benchmark, update params#24

feat: Expand benchmark, update params#24
Pringled merged 20 commits into
mainfrom
expand-benchmark

Pringled commented Apr 17, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Pringled commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Pringled commented Apr 17, 2026 •

edited

Loading